Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Twitter text normalization based on unsupervised learning algorithm
DENG Jiayuan, JI Donghong, FEI Chaoqun, REN Yafeng
Journal of Computer Applications    2016, 36 (7): 1887-1892.   DOI: 10.11772/j.issn.1001-9081.2016.07.1887
Abstract631)      PDF (945KB)(311)       Save
Twitter messages contain a large number of nonstandard tokens, created unintentionally or intentionally by people. It is crucial to normalize the nonstandard tokens for various natural language processing applications. In terms of the existing normalization systems which perform poorly, a novel unsupervised normalization system was proposed. First, a standard dictionary was used to determine whether a tweet needs to be normalized or not. Second, a nonstandard token was considered to take 1-to-1 or 1-to- N recovering based on its characteristics. For 1-to- N recovering, the nonstandard token would be divided into multiple possible words using forward and backward search. Third, some normalization candidates were generated for nonstandard tokens among multiple possible words by integrating random walk and spelling checker. Finally, the best normalized twitter could be obtained by taking all the candidates into consideration of n-gram language model. The experimental results on the manual dataset show that the proposed approach obtains F-score of 86.4%, which is 10 percentage points higher than that of current best graph-based random walk algorithm.
Reference | Related Articles | Metrics